You may submit your answers in pairs, in R (,Rmd file). Submission will be performed electronically via Moodle.

We urge you to start solving this sheet as soon as possible and, if you have any questions, come to visit us in reception hours next week.

Submit the your answeers for questions 1-4.

Question 1:

In this question we will look at the iris classic dataset (built in in r environment):

  1. Plot a box plot of the numeric features
  2. Provide two sided confidence interval for the Sepal.Length & Sepal.Width. How was that confidence interval calculated? e.g what assumptions on the mean and variance fit to this particular confidence interval.
  3. Provide a confidence interval for the probability to belong to the setosa species.
  4. Does the above confidence interval really represent the confidence interval of this setosa species among the population of all the iris species? e.g, is this dataset biased in some meaning?
  5. Plot a scatter plot such that:
* Sepal.Length will appear on the x axis 
* Sepal.Width will appear on the y axis
* Each point will have color based on the species of that given iris
* The figure will have a legend that explains the colors of the species 
  1. Look at that splendid iris -

Suppose it has Sepal.Length of 7 and Sepal.Width of 3. Based on the previous figure, which iris type would you say it is?

Question 2:

This question is an appetizer for hypothesis test. we will create a QQ plot. A QQ plot is a visual test (not a statistical test!) to check whether a one dimensional variable is distributed according according to some kind of a known distribution.

Instructions are provided in the attached link:

  1. What is a QQ plot? e.g, what appears in the x and y axis?
  2. make a QQ plot for sample1 and sample2 (with the normal distribution). are the plots the same? What’s the difference?
set.seed(0)
sample1 <- rnorm(100,100,36)
sample2 <- rnorm(100^2,100,36)
####### 
# your code here


#######

Question 3:

The breaking strength of yarn used in manufacturing drapery material is required to be at least 100 psi. Past experience has indicated that breaking strength is normally distributed and that \(\sigma = 2\) psi. A random sample of nine specimens is tested, and the average breaking strength is found to be 98 psi. Find a 95% two-sided confidence interval on the true mean breaking strength.

Question 4:

In order to find the proportion of defective bottles produced by a machine, two samples were taken. In sample 1, out of \(n_1\) bottles \(d_1\) werefound to be defective and in sample 2 out of \(n_2\) bottles \(d_2\) werefound to be defective.

Hereby there are 3 different estimators for the proportion of defective bottles:

  1. \(\hat{p}_{1}=\frac{d_{1}}{n_{1}}\)
  2. \(\hat{p}_{2}=\frac{d_{1}+d_{2}}{n_{1}+n_{2}}\)
  3. \(\hat{p}_{3}=\frac{1}{2}\cdot\left(\frac{d_{1}}{n_{1}}+\frac{d_{2}}{n_{2}}\right)=\frac{d_{1}}{2 n_{1}}+\frac{d_{2}}{2 n_{2}}\)

For each one of the above estimators. Check if it is unbiased and calculate the MSE (mean squared error) of that estimator.

Question 5:

Read chapter 5 in r4ds and do excercises 5.24 - 5.41.